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Abstract 

We introduce the Conditional Mutual Information (CMI) for the estimation of the 
Markov chain order. For a Markov chain of K symbols, we define CMI of order m, 
I c (m), as the mutual information of two variables in the chain being m time steps apart, 
conditioning on the intermediate variables of the chain. We find approximate analytic 
significance limits based on the estimation bias of CMI and develop a randomization 
significance test of I c (m), where the randomized symbol sequences are formed by ran- 
dom permutation of the components of the original symbol sequence. The significance 
test is applied for increasing m and the Markov chain order is estimated by the last 
order for which the null hypothesis is rejected. We present the appropriateness of 
CMI-testing on Monte Carlo simulations and compare it to the Akaike and Bayesian 
information criteria, the maximal fluctuation method (Peres-Shields estimator) and a 
likelihood ratio test for increasing orders using (^-divergence. The order criterion of 
CMI-testing turns out to be superior for orders larger than one, but its effectiveness for 
large orders depends on data availability. In view of the results from the simulations, 
we interpret the estimated orders by the CMI-testing and the other criteria on genes and 
intergenic regions of DNA chains. 

Keywords: order estimation, Markov chains, conditional mutual information (CMI), 
randomization test, DNA 
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1. Introduction 

Let {x,}^ =l denote a symbol sequence generated by a Markov chain {X,}, of an un- 
known order L > 1 in a discrete space of K possible states A = {a\, . . . ,0%}. The 
objective is to estimate L from the symbol sequence {x,}^ =l for a limited length N. 

Many criteria for Markov chain order estimation have been proposed and eval- 
uated in terms of their asymptotic properties. The Bayesian information criterion 
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(BIC) was proposed to render consistency of the popular Akaike information crite- 
rion (AIC) MM 

12211 . However, BIC was found to perform worse than AIC for small 



sequence lengths, questioning the value of asymptotic properties in practical problems 



[Kj|21|]. A more recent and general criterion than AIC and BIC is the efficient de- 
termination criterion (EDC), opting for a penalty function from a wide range of possible 
such functions 1I25I1 . Peres-Shields proposed in lfl9ll the maximal fluctuation method, 
which compares transition probabilities for words of increasing lengths, and Dalevi 
and Dubhashi 17J] modified it for practical settings and, instead of having to set a dif- 
ferent threshold for each problem, they estimate the order from a sharp change in the 
transition probabilities. They found that the Peres-Shields (PS) estimator is simpler, 
faster and more robust to noise than other criteria like AIC and BIC [7J]. Another 
method is that of global dependency level (GDL), also called relative entropy, using 
the /-divergence to measure the discrepancy between two probability distributions 0]. 
GDL was found consistent and more efficient than AIC and BIC on relatively small 



sequences. Finally, the method of Menendez et al [I_4|4l6fl makes likelihood ratio tests 
for increasing orders using the ^-divergence measures 111 711 . This procedure was found 
more powerful in tested cases than the existing chi-square and likelihood ratio proce- 
dures, and it has also been applied to DNA [ 16]. 

Here, we follow a different approach and estimate the Markov chain order from 
sequential hypothesis testing for the significance of the conditional mutual information 
(CMI) for increasing orders m, denoted as I c (m). I c (m) is the mutual information of 
X, and Xj +m conditioning on the intermediate variables of the chain, jcj+i, . . . ,JCf +m _i. A 
significant I c {m) indicates that the order of the Markov chain is at least m. Thus the 
repetition of the significance test of / c (m) for increasing m allows for the estimation 
of the Markov chain order L from the last order m for which the null hypothesis of 
zero CMI is rejected. We show that the significance bounds for / c (m) formed by means 
of appropriate resampling are more accurate than the approximate analytic bounds we 
derived based on previous analytic results on the bias of entropy 12011 . We further 
compare the CMI testing with other criteria for order selection on simulated Markov 
chains and DNA sequences. 

The structure of the paper is as follows. In Section |2] CMI is defined and esti- 
mated on symbol sequences, an analytic significance limit of CMI is derived, and a 
randomization significance test is proposed, forming our method of CMI-testing for 
the estimation of the Markov chain order. Other methods for estimating the Markov 
chain order are briefly presented. In Section [3] we assess the efficiency of the proposed 
CMI-testing and compare it to other order selection criteria on simulations of Markov 
chains produced by randomly chosen transition probability matrices of different order, 
as well as transition probability matrices estimated on genes and intergenic regions of 
DNA sequence. In SectionHl we apply the CMI testing to the two DNA sequences and 
investigate the limitations of order estimation in terms of data size. Finally, concluding 
remarks are discussed in Section [5] 

2. Conditional Mutual Information and Markov Chain Order Estimation 

First we define CMI in terms of mutual information and subsequently entropies. 
The Shannon entropy expresses the information (or uncertainty) of a random variable 
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Xt 

H(X) = -Y j p(x)lnp(x), 

X 

where the sum is defined for all possible symbols (discrete values) x e A, and p(x) is the 
probability of x occurring in the chain. The definition of Shannon entropy is extended 
to a vector variable X, = [X,,X t -\, . . . ,X f _ m+ i] from a stationary Markov chain {X,}, 
referred to as word of length m, and reads 



H(K t ) = - J] p(x,)\np(x,), 



where x, = {x t , x t -\ ,.. ., x t - m+ \ } e A'", p(x t ) is the probability of a word x, occurring in 
the chain, and the sum is over all possible words of K symbols and length m. 

The mutual information (MI) of two random variables in the Markov chain being 
m time steps apart, denoted 7(m) = I(X,; X,- m ), is defined in terms of entropy as J5| 

7(m) = H(X,) + H(X,- m ) - H(X„ X t . m ) = V p(x„ x t - m ) In p(x ' ,X '- m) (1) 

xfxT-,,, P(Xt)p{X,- m ) 

While 7(1) quantifies the amount of information X t -\ carries about X, and vice 
versa, 7(2) cannot be interpreted accordingly due to the presence of X t -\, and the in- 
formation of X,-2 about X,, or part of it, may already be shared with X t -i. Thus if we 
are after the genuine information of X,_2 about X,, we need to account for the infor- 
mation of about X,. This is indeed desired when we want to estimate the memory 
of the process, i.e. the order of the Markov chain. The appropriate measure for this 
is the conditional mutual information (CMI). CMI of order m is defined as the mutual 
information of X, and X,_,„ conditioning on X t - m +\ , . . . , J^t] 

I c (m) = I(X t ;X t - m \X t -\,...,X t - m +i) 

= I(X,; Xf-\, . . . , Xi-m) - I(X,;X t -i, . . . , X t - m+ \) 
= -H(X,, . . . , X t - m ) + H(X,-u . . ., X,- m ) 
+ H(X,, . . . ,Xf- m+ \) - H(X,-\, . . . ,X t - m+ i) 

p(x t , Xt- m ) In — -. (2) 

x„...,x, m p(Xt\x t -U...,X t - m +\) 

CMI coincides with MI for successive random variables in the chain, that is / c (l) = 
7(1). 



2.7. Estimation of Conditional Mutual Information 

The estimation of CMI is given through the estimation of the joint probability and 
the conditional probabilities in (O by the corresponding relative frequencies. Specifi- 
cally, the maximum likelihood estimate (MLE) of p(x,, x t -i, • ■ ■ , x,_ m+ i) is 

n h,-,i m 

p(x t , Xt-\, . . . , JC/- m +l) — i 
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where , m is the frequency of occurrence of a word . . . , i m } e A'" in the sym- 
bol sequence defined as «;,,...,;,„ = Z^l m I(-£/ = h, ■ ■ ■ , Xt-m+i — *m)> where I 
denotes the indicator function. Respectively, the MLE of the conditional probability 

p{x t \x t -i,...,Xt- m ) is 

p(x t \x t -\, . . . ,x,- m ) - . 

The estimate I c (m) of I c (m), by substituting the probability estimates in (f2]i, inherently 
suffers from the inefficiency of MLE at high dimensions being less accurate with the 
increase of m or K and the decrease of N, but also more biased. It has been proved 
that entropy estimation involves a negative bias, i.e. the estimated value is lower than 
the real one [13, 20ll. Consequently, MI estimation has positive bias which increases 



with K and m |Tj2(J, and thus the estimation of CMI has also positive bias, as CMI 
is the difference of two MI terms, where the arguments in the first MI term have jointly 
a dimension larger by one than that of the second MI term. Expressing the bias of 
CMI as the difference of the bias of two MI terms, indicates that the CMI estimate has 
lower bias than the bias for the respective Mi's, which has been shown for continuous 



variables in 02311 



An approximate expression for the bias of the entropy estimate of a random variable 
of K symbols from a sample of size N is given by I20I1 

H(X) - H(X) = -(K - 1)/(2A0. 

Noting that a word X, of length m is equivalent to a random variable X in K" 1 symbols, 
we can express in the same way the bias of the entropy estimate for a word X, from a 
symbol sequence of length N as 

H(X,) - H(X r ) = -(K m - 1)/(2A0. 

Substituting the expressions for the entropy bias in the definition of CMI in terms of 
entropies in (0, we find the following approximation for the bias of the CMI estimate 

I c (m) - 7 c (m) = K n, -\K - if/lN. (3) 

Note that the approximate bias for I{X,\Xt- m ) for any m is (K - 1) 2 /2N derived from 
([3]) for m - 1 . From ([3]i it can be seen that the bias of CMI increases when any of K 
and m increases and decreases. 

2.2. Randomization test for the significance of CMI 

We use CMI to estimate the order L of a Markov chain. The fundamental property 
of a Markov chain of order L is 

p(X,\X,^ , X,_ 2 , . . . , X t . L , X M , ...) = p(X t \X t . u X,_ 2 , . . . , X t . L ), 

meaning that the distribution of the variable X, of the Markov chain at time t is deter- 
mined in terms only of the preceding L variables of the chain. Thus for any lag order 
m < L, we expect in general two variables m time steps apart to be dependent given the 
m - 1 intermediate variables, and then I c (m) > 0. On the other hand, for m > L it must 
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be I c (m) = 0. Note that it is possible that I c (m) = for m < L, but not for m = L, as 
then the Markov chain order would not be L. So, increasing the order m, we expect in 
general when I c (m) > and I c (m + 1) = to have m — L. To account for complicated 
and rather unusual cases where I c (m + 1) = occurs for m + 1 < L, we can extend the 
condition I c (m) > and I c (m + 1) = to require also I c (m + 2) = 0, and even further 
up to some order m + k. 

The condition I c (m + 1) = for m-L does not hold exactly when estimating CMI 
from finite symbol sequences, and we always have I c (m + 1) > due to positive bias in 
the estimation of I c (m + 1). To address this, a significance test of I c {m) for increasing 
m has to be developed for the null hypothesis Ho : I c {m) = 0. In the absence of a 
rigorous analytic null distribution of the test statistic I c (m), we propose a randomization 
test using an ensemble of resampled (actually randomized as we preserve the marginal 
distribution) symbol sequences in order to form the empirical null distribution of I c (m). 
The test is one-sided with alternative hypothesis Hi : I c (m) > 0, as the estimation bias 
of I c {m) is positive. The randomization test is developed in the following steps. 

1. We generate M randomized symbol sequences [x* }f =l , ■ . ., {x* }v =l , by random 
permutation of the initial sequence \xt}^ =l . 

2. We compute t c (m) on the original symbol sequence, denoted /J?(m), and on the 
M randomized sequences, denoted I* (m), . . . , I* M {m). 

3. We reject Ho if I^(m) is at the right tail of the empirical null distribution formed 
by I* l (m), . . . , I* M {m). To assess this we use rank ordering, where r° is the rank 
of f^(m) in the ordered list of the M + 1 values, assuming ascending order. The 
p-value of the one-sided test is 1 - (r° - 0.326)/(M+ 1 + 0.348) (this correction 
for the empirical cumulative function is proposed in ll24ll ). 

The randomized sequences are by construction independent, but with the same marginal 
distribution as the original sequence, and therefore they are consistent with Ho. The es- 
timation of L with the proposed CMI-testing involves sequential implementation of the 
randomization significance test of I c {m) for increasing m, starting with m = 1, The 
repetitive procedure stops at an order m + 1 if no rejection of Ho is obtained and then 
L = m. To avoid premature termination of the sequential testing, which however can 
only be expected in special practical cases, the termination criterion may require that 
Ho is not rejected for more than one orders exceeding L = m. The termination criterion 
in the CMI-testing does not require a maximum order to be defined, which constitutes 
a free parameter for other order selection criteria [0 Uol . 

We illustrate the proposed CMI-testing with an example of the estimation of the 
order L — 4 of a Markov chain of K = 2 symbols defined by a randomly selected 
transition matrix. The CMI estimate I c (m) for m = 1 , . . . , 10 computed on three symbol 
sequences of length = 1000 generated by this Markov chain is shown in Figure QJ 
together with the approximate bias estimate for I c (m) = derived in (f3). For all three 
realizations, I c (m) tends to increase for m < L, then for m = L + 1 drops at the level 
of the approximate bias, and further increases for m > L but does not exceed the 
approximate bias of zero CMI. Though the approximate bias seems to discriminate 
significant I c (m) for m < L from insignificant I c (m) for m > L, it does not constitute an 
accurate upper bound of significance to be used as a criterion for the estimation of L. 
Note that in FigureQJ not all three I C (L + 1) are below the approximate bias value. The 
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use of randomized sequences turns out to provide more accurate significance limits. As 
shown in Figure QJ) for one of the three realizations, when m < L, I®{m) is larger than 
any I*'(m) for the M = 1000 randomized sequences, but it is smaller or within the range 
of I*'(m), i — 1, . . . ,M when m > L. The p-values of the tests shown in Figure QJ for 
the three realizations confirms the correct estimation of L — 4 using the CMI-testing, 
being less than the significance limit a = 0.05 for m < L and larger for m > L. In the 
next Section, these findings are established by means of Monte Carlo simulations and 
compared to other known order estimation criteria. 



2.3. Other criteria for Markov chain order estimation 

There are various Markov chain order estimators in the literature 
17, JjJ 21, 22, 25h, and we briefly discuss here the most prominent ones that we also 



consider in the comparative study . The first is the well-known Akaike's information 
criterion (AIC) (HQS 12211 . which uses the Kullback-Leibler information to define the 



likelihood ratio (LR) statistic of k-th order versus L-th order Markov chain 

N 

n KL = -2Y J fo{f(xi@k)/f{x i \hj), 

i=l 

where 9l is the unrestricted maximum likelihood estimate (MLE) of 6. The AIC func- 
tion is 

AIC(Jk) =n k>L - 2(K L+1 - K k+1 )(K - 1), 

and the estimated order is k — arg mino^<z. AlC(fc). 

Katz flO] applied the Bayesian information criterion (BIC) to the problem of Markov 
chain order estimation. Similarly to AIC, BIC is defined as 

BlC(fc) =n kfL - {K L+l - K k+l ){K - 1) In JV, 

and the order estimate is k — arg mino^z. BlC(fc). Though AIC is known to be incon- 
sistent and BIC consistent order estimate W, it was shown that BIC does not perform 
as well as AIC for small sample sizes fl llOll . 



The Peres-Shields estimator IU9I1 uses the so-called fluctuation function 



A K (u) = max 



aeA 



, , N x (r k (v)a) 
N x {T k (v)) 



where N x (v) denotes the frequency of occurrence of the word u of length I in {x ( }^j 
and Tk(v) denotes the A:-suffix of v, i.e. T^iv) = vl_ k v The initial expression of the 
Peres-Shields estimator is rather complicated and Dalevi and Dubhashi 01 proposed a 
simpler estimator, still close to the original Peres-Shields estimator, given as 

I = argmax(A*(y)/A* +1 (u)) 
and we use this estimator in the comparative study denoted as PS. 



Menendez et al jl4j] start with the observation that the LR test can be expressed in 



terms of the Kullback-Liebler divergence, which belongs to the class of the so-called 
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(^-divergence measures. Then they generalize LR for orders k and k + 1 for any 
divergence given as 

2 v-i tp(a k+ i\a\,...,a k )\ 



^(fli + i|a 2 , ■■■,a k )J 

where 0" is the second derivative of <p. For <p{x) = x log x - x + 1 we get the standard 



LR in terms of Kullback-Liebler divergence. Menendez et al [16] suggest using 0(x) = 
(A(A + l))~'(x' 1+I - x + A(l - x)) for A = 2/3, and they repeat LR test for increasing 
k order until no rejection is obtained, where S * follows the Chi-squared distribution 
with (K k+l - K k )(K - 1) degrees of freedom. We adopt this form of the test in the 
comparative study and denote it Sf. 



3. Monte Carlo Simulations 

We compare the CMI-testing to the approximate CMI bias estimate of (01, as well 
as other known criteria for the estimation of the Markov chain order L, and for this 
we use Monte Carlo simulations for varying parameters L, K and N. For each param- 
eter setting, we use 100 realizations and for the CMI-testing M = 1000 randomized 
sequences for each realization, and for all estimation methods the order is sought in 
the range m = 1, . . . , L + I. In the first simulation setup, Markov chains are derived 
by randomly set transition probability matrices of given order L, while in the second 
simulation setup Markov chains are derived by transition matrices of given order L, 
estimated on two DNA sequence of genes and intergenic regions. The results on the 
latter setting will give us the grounds for interpreting the results from the estimation of 
the Markov chain order on the DNA sequences in Sec. |4] 

3.1. Randomly selected transition probabilities 

First, we confirm the results about the illustrative example of Figure Q] using 100 
realizations. As m increases from 1 to L, I c (m) increases and lies over the approximate 
bias for I c (m) = defined in ||3}, as shown by the boxplots in Figured and indicated 
by the number of cases / c (m) exceeding the limit of the bias approximation for each 
m. However, when m = L + 1, I c (m) falls below this approximated bias for only 
about half of the 100 realizations, indicating that the approximate bias cannot establish 
the significance of I c {m). On the other hand, significance is well-established by the 
proposed CMI-testing, and the transition from significant to insignificant I c {m) at m = 
L can be safely detected. As shown in Figure |Z|), for all but one realization Ho : 
I C (L) = is rejected at the significance level a = 0.05, and only for 8 realizations Ho : 
I C (L+ 1) = is rejected at the same a. We note that for m = 1 the power of I c (m) is very 
low (rejection is obtained for only 57 cases), and improves as m increases towards L. 
The first reason is that in general the randomization test is conservative Hill , e.g. note 
that / c (l) is over the significance bias limit far more often than the significance limit 
drawn by the randomized sequences. The second and most important reason is specific 
to the simulation setup. The random selection of the transition matrix determines on 
average even dependence of one symbol in the chain to the L preceding symbols. Thus 
the knowledge of x t -\ contributes partially (by a factor of about 1/4 for our example 
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Table 1: Number of times the correct order L is estimated in 100 realizations by the 
criteria CMI, AIC, BIC, PS and Sf. The 100 symbol sequences of length N (being 
500, 1000 and 6000 as indicated in the first row) are generated by a Markov chain of 
K — 2 symbols with a randomly selected transition probability matrix of order L, for L 
varying as given in the second row. The best success rate for each L is highlighted. 



criterion 


N = 500 




N = 


1000 






N = 


6000 






L = 2 


L= 3 


L = 4 


L = 5 


L = 7 


L = 2 


L = 4 


L = 6 


L = 8 


L = 2 


L = 4 


L = 6 


L = 8 


CMI 


81 


92 


95 


94 


73 


87 


91 


98 


91 


93 


98 


95 


97 


AIC 


66 


62 


52 


38 


2 


75 


68 


33 


2 


77 


80 


60 


36 


BIC 


69 


56 


35 


1 





77 


59 


1 





88 


80 


59 





PS 


78 


72 


73 


63 


37 


88 


77 


71 


47 


97 


98 


99 


96 


Sf 


86 


60 


46 


28 





92 


42 


29 





97 


3 


1 


40 



with L — 4) to the total information about x t , and therefore I(x,\ x t -i) = / f (l) is small 
and can often be at the border of being statistically significant. The bias of I c (l) here is 
l/(2N) (see (f3]l for K = 2) and thus very close to zero, so that the distribution of I*' (I) 
from the randomized sequences is asymmetric and more likely to be broader to the right 
than for a larger bias. The information from x,_2 about x t is at the same level as for x f _i 
when accounting for x t -\, but now the bias of I c {2) is doubled and the distribution for 
the randomized sequences is more symmetric and less broader to the right, so that I c {2) 
may be at the right tail of the null distribution more often. This argument explains the 
increase of the percentage of rejections as m increases from one to L. 

We compare the CMI-testing, and refer to it simply as CMI, to four known cri- 
teria for the estimation of L: the Akaike's information criterion (AIC) 1(1 22], the 
Bayesian information criterion (BIC) JillilGlllII]], the criterion of Dalevi and Dubashi 
which is based on the Peres and Shield's estimator (PS) 0, H, and the criterion of 
Menendez et al (Sf) lfl5l[l6Tl . Table[T]presents the frequency of estimating correctly L 
for Markov chains of the first simulation setup, K -2,N - 500, 1000 and N = 6000. 
For L = 2, Sf scores highest with CMI being close behind, but for larger L the suc- 
cess rate of Sf decreases steadily while CMI estimates the correct order almost always, 
scoring much higher than all the other criteria. For the largest order L — 7 examined 
for N = 500, the success rate of CMI decreases, probably due to insufficient data size 
for such a large order, but the other criteria fail completely to estimate this order and 
only PS manages it for 37 of the realizations. All methods improve their performance 
when the sequence length increases to N = 1000 but at about the same degree so that 
the main differences persist. For larger L CMI maintains the highest success rate at a 
level over 90%, even for L = 8, while the other criteria fail, with Sf dropping again to 
the zero level for L = 8. AIC and BIC follow a similar decreasing success rate with L 
down to the zero level with BIC being worse, and PS attains higher success rates for 
larger L but still much lower than for CMI. When the chain length further increases 
(N = 6000), PS improves and performs as well as CMI. 

The estimation of L is more data demanding when there are more symbols K. As 
shown for K - 4 in Table |2] for N = 500, though CMI, PS and Sf succeed to identify 
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Table 2: Same as for Table[T| but for K — 4. 



criterion 




N = 


500 






N 


= 1000 






N = 


6000 






L = 2 


L = 3 


L = 4 


L = 5 


L = 2 


L = 3 


L = 4 


L = 5 


L = 6 


L = 2 


L = 4 


L - 5 


L = 6 


CMI 


100 


100 


18 


1 


98 


100 


96 


2 


3 


96 


100 


100 


5 


AIC 














2 














37 











BIC 





























32 











PS 


96 


81 


46 


23 


100 


89 


46 


21 


14 


100 


100 


61 


19 


Sf 


100 


72 








100 


100 











100 


100 


62 






the correct order for L — 2, 3 (with CMI scoring highest), all but PS fail for L > 3 with 
the score of PS falling slowest. For larger N (N = 1000 and N = 6000) the failure 
of the criteria occurs for larger L. AIC and BIC fail completely and only for L — 2 
and N = 6000 they have a success rate at about one third. Again CMI keeps the high 
success rate as L increases until it collapses due to lack of sufficient data, but so do the 
other criteria already for smaller L. For example, for L = 3 and N = 1000 both CMI 
and Sf score highest, but for L — 4 CMI still scores very high while Sf has dropped 
to zero score. Generally, Sf has the tendency to underestimate the order for larger L. 
Specifically, for the above simulation when L = 4, Sf estimates m = 1, m = 2, m = 3 at 
the rates 52%, 32%, 16% respectively. For larger L, PS tends to maintain some positive 
success rate when all other criteria fail completely (almost completely for CMI). 

The results of the simulation setup of randomly selected transition matrices showed 
that CMI overall outperforms the other criteria, whereas PS scores well for large L (at 
cases even higher than CMI), and Sf is best for very small L but scores poorly for larger 
L. AIC and BIC perform well for small number of symbols K, but their requirement 
for data size increases faster with K than for the other criteria. BIC tends to perform 
better than AIC for very small L, but this situation is reversed when L increases. PS 
and CMI keep the highest rate for larger orders over all settings. However, CMI stays 
ahead when the length of symbol sequences is smaller, while the success rate of PS 
seems to fall slower when the order becomes larger. 

3.2. Transition probabilities estimated on DNA 

DNA consists of four nucleotides, the two purines, adenine (A) and guanine (G), 
and the two pyrimidines, cytosine (C) and thymine (T), so DNA sequence can be con- 
sidered as a symbolic sequence on the symbols A,C,G,T. In our analysis we use a 
large segment of the Chromosome 1 of the plant Arabidopsis thaliana. We use two 
sequences, one joining together the genes, which contain non-coding regions, called 
introns, in between the coding regions, called exons, and another sequence joining to- 
gether the intergenic regions which have non-coding character. The sequences used 
here are segments of the long sequences used in 111 211 . 

For the second simulation setup we form the Markov chains from transition ma- 
trices of given order L estimated on the two DNA sequences of genes and intergenic 
regions, each of length = 6000. We make the simulations for K = 4 symbols (A, C, 
G, T) and K - 2 symbols (purines, pyrimidines). 
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Table 3: Same format of results as for TableQ] but for transition matrices of given order 
L estimated from a DNA sequence of genes of length N = 6000 in the form of purines 
and pyrimidines (K = 2) and all the four nucleotides (K = 4). 



criterion 




K 


= 2 






K 


= 4 






L = 2 


L = 3 


L = 4 


L = 5 


L = 2 


L= 3 


L = 4 


L = 5 


CMI 


50 


65 


15 


8 


93 


66 


35 





AIC 


63 


82 


15 


2 


87 











BIC 


4 























PS 


56 


58 


13 


9 


73 


42 


14 


8 


Sf 


50 


20 


1 





99 


19 









Table 4: Same as for Table [3] but for the DNA sequence of intergenic regions. 



criterion 






K 


= 2 








K 


= 4 






L = 2 


L = 4 


L 


= 5 


L = 6 


L = 7 


L = 2 


L = 3 


L = 4 


L = 5 


CMI 


81 


83 




50 


44 


16 


94 


66 


45 


8 


AIC 


85 


88 




27 


1 





79 











BIC 


42 




























PS 


81 


43 




27 


28 


14 


92 


28 


10 


21 


Sf 


85 


9 




1 








98 


23 


1 






The probability transition matrices of any order L estimated on the DNA sequences 
give more complicated structures of the Markov chains and make the estimation of 
L harder than when they are randomly selected. As shown in Table [3] for the gene 
sequence, all criteria score lower than for the respective orders L of the first simulation 
setup even for very small L, though we use quite large sequences (N = 6000). CMI 
is generally best for K — 4, e.g. for L — 3 CMI estimates the correct order for 2/3 of 
the realizations with PS being second best estimating correctly for 42 realizations, Sf 
for only 19, and AIC and BIC for none. However, AIC performs better than the other 
criteria when K — 2, being best for L — 2, L — 3 and L — 4 followed by PS and CMI 
(and Sf for L = 2). For both K — 2 and K — 4, as L increases the percentage of success 
rate falls sharply for the other criteria but more regularly for CMI and PS. 

For the intergenic regions the results are somehow better for all criteria. As shown 
in Table g] for K - 2 AIC scores highest for L = 2 and L = 4, with CMI following 
very closely (and Sf only for L = 2). As L increases the percentage of success rate falls 
sharply for the other criteria but more regularly for CMI and PS, while CMI scores 
highest maintaining a success rate at about 50% for L = 5 and L — 6. The same 
rapidly decreasing success rate for L > 2 holds for K — 4 and for all but CMI criteria. 
Nevertheless, CMI fails also to estimate the correct order for L > 6 and L > 4 when 
K - 2 and K = 4, respectively. The latter indicates the limit of orders that can be 
estimated with CMI for = 6000, so that if the real DNA sequence has larger order (or 
even infinite) this could not be estimated by CMI with such limited sequence. Generally, 
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AIC outperforms the other criteria for smaller orders and fewer symbols. However, 
CMI and PS score highest for larger L, while CMI performs better than PS for larger 
K. 

4. Application on DNA sequences 

In recent years, much of the statistical analysis of DNA sequences is focused on 
the estimation of properties of coding and non-coding regions as well as on the dis- 
crimination of these regions. There has been evidence that there is a different structure 
in coding and non-coding sequences and that the non-coding sequences tend to have 
long range correlation, whereas the correlation in coding sequences exhibits exponen- 
tial decay II ll 14, 1 1811 . Here we use intergenic and gene sequences. The latter is a mixture 
of coding regions (exons) and non-coding regions (introns), and therefore we expect 
to have also long correlation due to the non-coding regions in it, but it should be less 
than the correlation in the intergenic regions consisting only of non-coding parts. Thus 
both DNA sequences cannot be considered as Markov chains, at least not of a moderate 
order, and the estimation of the order L should increase with the data size. 

We estimate the order L of a hypothesized Markov Chain on Chromosome 1 of 
plant Arabidopsis thaliana by the CMI-testing and the other criteria. We make the 
computations for both genes and intergenic regions of length N = 10000 and N = 
100000 and for K — 2 (purines, pyrimidines), and the estimated orders from all criteria 
are shown in Figure [3] First we note that all criteria tend to estimate larger order as N 
increases, and for the same N they find larger order for the intergenic sequence, both 
features being in agreement with the discussion above. CMI establishes best these two 
features. The difference in the order of genes and intergenic regions holds for both 
N and their orders increase the most from 3 and 6 for N = 10000 to 9 and 12 for 
N = 100000, respectively. AIC estimates the same orders as CMI for N = 10000, 
but for N = 100000 only the estimated order for genes increases to 6 approaching the 
order for the intergenic region staying at about the same level. The other three criteria 
estimate smaller orders than CMI and AIC for N = 10000. Moreover, PS gives for 
N = 10000 the reverse pattern of the order for genes being 3 and for intergenic regions 
being 2, which changes to 3 and 9 for N = 100000, respectively. BIC and Sf give order 
estimates closer to the expected two features, but the order estimation is at a lower level 
than for CMI with Sf giving larger orders than BIC. CMI is the most consistent to the 
hypothesis of long range correlation in the intergenic sequence, and at a lesser degree 
to the gene sequence, as it provides the largest dependence of the order to the sequence 
length and maintains larger order for the intergenic sequence. 

5. Discussion 

In this work we propose the use of the measure of conditional mutual information 
(CMI) for the estimation of the order of Markov chain, in an analogous way the partial 
autocorrelation is used for the estimation of the order of an autoregressive model in 
time series 0]. Among others, a main difference is that the significance limits for 
partial autocorrelation are defined parametrically (under mild conditions), while for 
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CMI only approximate limits have been reported. Our simulations on analytic limits 
for the bias of CMI, which we have worked out, showed that they cannot provide 
accurate estimation of the Markov chain order L. Therefore we have built a scheme 
called CMI-testing, applying iteratively a randomization significance test for CMI, and 
the estimation of L is given by the largest order m for which CMI is found statistically 
significant. Thus CMI-testing does not implicate any maximum order, as for example 
the criteria of AIC and BIC. 

We compared CMI-testing to a number of other known order selection criteria using 
Monte Carlo simulations of Markov chains of varying order L and number of symbols 
K, and for different sequence lengths N. Randomization tests tend to be more conser- 
vative for small data sizes, but we found that CMI-testing could identify the correct L 
even at small sequences, e.g. for K - 2, N - 500 and L — 5 the success rate was 
94%. For larger K and L, and for smaller N, the accuracy of the estimation worsened, 
but still compared to the other criteria it was generally the highest. For small L, other 
criteria could score higher but CMI-testing always followed closely. 

The simulations showed the appropriateness of CMI-testing in the settings of non- 
trivial structures in the symbol sequences, involving high Markov chain order L. This 
was further confirmed by the simulations on Markov chains estimated on DNA se- 
quences, but also when applied, along with other criteria, to two real DNA sequences, 
one comprised of genes and the other of intergenic regions. Many reported works con- 
verge to that intergenic regions (consisting solely of non-coding DNA) have long range 
correlations, and genes (containing coding and non-coding DNA) have a mixture of 
short and long range correlations. As the estimation of CMI is computationally in- 
tensive, we made computations on DNA sequences up to the length N = 100000, for 
which CMI-testing gave the largest Markov chain orders 9 and 12 for the genes and in- 
tergenic sequences, respectively, being both higher than the orders obtained by any of 
the other criteria. This confirms the ability of CMI-testing in identifying large orders, 
as confirmed also in the simulations. 

To the best of our knowledge, this is the first work using CMI for the estimation of 
Markov chain order, and it certainly bears further improvement. For the randomization 
significance test we use randomly shuffled sequences irrespective of the order L, and 
we attribute to this lack of any dependence in the surrogate sequences the observation 
that for orders larger than the correct order L the original CMI is often at the lower 
tail of the distribution of the CMI on the surrogates. One possible improvement is to 
adjust the resampled sequences to the tested order, e.g. to be generated from Markov 
chains of order being one less than the tested order. However, the generation of such 
randomized sequences is not straightforward and it would additionally add to the heavy 
computational cost in CMI-testing. The latter is a disadvantage of CMI-testing in prob- 
lems were computation time may be an issue or when the sequence length is very large 
as for DNA. A parametric significance test would be a solution, which may come at 
the cost of reduced accuracy as, to the best of our knowledge, there is no exact analytic 
distribution of CMI. We currently work on this issue developing approximations for 
the CMI distribution. 
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Figure 1: (a) CMI vs order m for three realizations of a Markov chain determined by 
a randomly selected transition matrix (K = 2, L = A, N = 1000). Superimposed 
is the approximate bias estimate for zero CMI denoted with a dashed grey (online 
red) line, (b) CMI vs order m for one of the three realizations in (a) and for 1000 
randomly shuffled sequences, as indicated in the legend, (c) The p-value vs order of 
the randomization significance test for the three realizations in (a). The horizontal 
dashed grey (online red) line is for the significance level a = 0.05. 
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Figure 2: Distribution of I c (m) in (a) and /^-values in (b) against the order m presented 
as boxplots from 100 realizations of the Markov chain as in Figure Q] The number 
below each boxplot is for the realizations for which t c {m) is over the approximate 
significance bias (displayed by a grey (cyan online) dashed line) in (a), and the p- 
value of the randomization significance test is below the significance level a = 0.05 
(displayed by a dashed line) in (b). 
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Figure 3: The estimated order L of a Markov chain on Chromosome 1 of plant 
Arabidopsis thaliana (genes and intergenic regions) by the CMI-testing and the other 
criteria. The computations are made for number of symbols K — 2 and length 
N = 10000 in (a) and N = 100000 in (b). 
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